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Abstract 


-Herertlddress  the  problem  of  combining  output  of  several  detectors  for  the  same  feature  of  an  image.  T 
$how5ihat  if  the  detectors  return  likelihoods  I  can  robustly  combine  their  outputs.  The  combination  has  the 
advantages  that: 
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~#>The  confidences  of  the  operators  in  their  own  reports  are  taken  into  account.  Hence  if  an  operator  is 
confident  about  the  situation  and  the  others  are  not  then  the  reports  of  the  confident  operator  dominates 
the  decision  proces^ 

priori  confidences  in  the  different  operators  can  be  taken  into  accoun^ 

„  •  The  work  to  combine  'N'  operators  is  linear  in  'N'. 

This  theory  has  been  applied  to  the  problem  of  boundary  detection.  Results  from  these  tests  are  presented 
here.  / 
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the  others  are  not  then  the  reports  of  the  confident  operator  dominates 
the  decision  process. 

*  A  priori  confidences  in  the  different  operators  can  be  taken  into  account 

*  The  work  to  combine  1 N*  operators  is  linear  in  'N'. 

This  theory  has  been  applied  to  the  problem  of  boundary  detection.  Results 
from  these  tests  are  presented  here. 
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1.  Introduction 

Often  in  computer  vision  one  has  a  task  to  do  such  as  deriving  the  boundaries  of  objects  in  an 
image  or  deriving  the  surface  orientation  of  objects  in  an  image.  Often  one  also  has  a  variety  of 
techniques  to  do  this  task.  For  boundary  detection  there  are  a  variety  of  techniques  from  classical 
edge  detection  literature  [Ballard82]  and  the  image  segmentation  literature  e.g.  [Ohlander79J. 
For  determining  surface  orientation  there  are  techniques  that  derive  surface  orientation  from 
intensities  [Horn70]  and  texture  [Dteuchi80]  [Aloimonos85].  These  techniques  make  certain 
assumptions  about  the  structure  of  the  scene  that  produced  the  data.  Such  techniques  are  only 
reliable  when  their  assumptions  are  met.  Here  I  show  that  if  several  algorithms  return  likelihoods 
I  can  derive  from  them  the  correct  likelihood  when  at  least  one  of  the  algorithms’  assumptions  are 
met.  Thus  I  derive  an  algorithm  that  works  well  when  any  of  the  individual  algorithms  works 
well. 

The  mathematics  here  were  derived  independently  but  are  similar  to  the  treatment  in 
[Good50].  and  [Good83],  using  different  notation.  To  understand  my  results  first  one  must 
understand  the  meaning  of  likelihood. 

2.  Likelihoods 

In  this  paper  I  call  the  assumptions  that  an  algorithm  makes  about  the  world  a  model.  Most 
models  for  computer  vision  problems  describe  how  configurations  in  the  real  world  generate 
observed  data.  Because  imaging  projects  away  information,  the  models  do  not  explicitly  state  how 
to  derive  the  configuration  of  the  real  world  from  the  sensor  data.  As  a  result,  graphics  problems 
are  considerably  easier  than  vision  problems.  Programs  can  generate  realistic  images  that  no 
program  can  analyse. 

Let  0  be  the  observed  data,  f  a  feature  of  the  scene  whose  existence  we  are  trying  to 
determine  (like  a  boundary  between  two  pixels)  and  M  a  model.  Many  computer  vision  problems 
can  be  reduced  to  finding  the  probability  of  the  feature  given  the  model  and  the  data,  P(f\OAM). 
However  most  models  for  computer  vision  instead  make  it  easy  to  compute  P(0\fAM).  I  call 
P(0\f&M)  (inspired  by  the  statistical  literature)  the  likelihood  of  f  given  observed  data  0  under 
M.  As  an  example  assume  f  is  "the  image  has  a  constant  intensity  before  noise”.  M  says  that  the 
image  has  a  normally  distributed  uncorrelated  (between  pixels)  number  added  to  each  pixel  (the 
noise).  Calculating  P(0\M&f)  is  straight-forward  (a  function  of  the  mean  and  variance  of  O). 

A  theorem  of  probability  theory,  Bayet’  law,  shows  how  to  derive  conditional  probabilities  for 
features  from  likelihoods  and  prior  probabilities.  Bayes’  law  is  shown  in  equation  1. 

P(f\OAM)=  _ P{Q\fAM)P{f\M) _ 

P(.0\fAM)P(f\W+P{0\~fAM)P{~f\M)  u' 

f  is  the  feature  for  which  we  have  likelihoods.  M  is  the  domain  model  we  are  using.  P(0\fAM)  is 
the  likelihood  of  f  under  M  and  P(f\M)  is  the  probability  under  M  of  f 

For  features  that  can  take  on  several  discrete  mutually  exclusive  labels  (rather  than  just  true 
and  false)  such  as  surface  orientation  (which  can  be  a  pair  of  angles  to  the  nearest  degree  or  "not 
applicable”  (at  boundaries))  a  more  complex  form  of  Bayes’  law  shown  in  equation  2  yields 
conditional  probabilities  from  likelihoods  and  priors. 


(2) 


P(l\0&M)  = 


P{0\lAM)P(l\M) 

U  P(0\VAM)P{1'\M) 


l  is  a  label  for  feature  f  and  L(f)  is  the  set  of  all  possible  labels  for  feature  f. 

Another  important  use  for  explicit  likelihoods  is  for  use  in  Markov  random  fields.  Markov 
random  fields  describe  complex  priors  that  can  capture  important  information.  Several  people  have 
applied  Markov  random  fields  to  vision  problems  [Geman84],  Likelihoods  can  be  used  in  a  Markov 
random  field  formulation  to  derive  estimates  of  boundary  positions  [Marroquin85b]  [Chou87].  In 
[Sher86]  and  [Sher87]  I  discuss  algorithms  for  determining  likelihoods  of  boundaries. 

Let  us  call  an  algorithm  that  generates  likelihoods  a  likelihood  generator.  Different  models 
lead  to  different  likelihood  generators.  The  difference  between  two  likelihood  generators'  models 
can  be  a  single  constant  (such  as  the  assumed  standard  deviation  of  the  noise)  or  the  two  likelihood 
generators’  models  may  not  resemble  each  other  in  the  slightest. 

Consider  likelihood  generators  L\  and  L2  with  models  Mx  and  M2  and  assume  they  both 
determine  probability  distributions  for  the  same  feature.  L\  can  be  considered  to  return  the 
likelihood  of  a  label  l  for  feature  f  given  observed  data  O  and  the  domain  model  M x.  Thus  Lx 
calculates  P(0\f=lAM  j).  Also  L2  calculates  P{0\f=lA M  2).  A  useful  combination  of  Lx  and  L% 
is  the  likelihood  detector  that  returns  the  likelihoods  for  the  case  where  M\  or  M t  is  true.  Also  the 
prior  confidences  one  has  in  Mx  and  M%  should  be  taken  into  account. 

this  paper  studies  deriving  P{0\f=\A(,M  x\lM%)).  Note  that  if  I  can  derive  rules  for 
combining  likelihoods  for  two  different  models  then  by  applying  the  combination  rules  N  times,  S 
likelihoods  are  combined.  Thus  all  that  is  needed  is  combination  rules  for  two  models. 


3.  Combining  Likelihoods  From  Different  Models 

To  combine  likelihoods  derived  under  and  M2  an  examination  of  the  structure  and 
interaction  of  the  two  models  is  necessary.  Mx  and  M%  must  have  the  same  definition  for  the 
feature  being  detected.  If  the  feature  is  defined  differently  for  Mx  and  Mt  then  Mx  and  are 
about  different  events,  and  the  likelihoods  can  not  be  combined  with  the  techniques  developed  in 
this  section. 

Thus  the  likelihood  generated  by  an  occlusion  boundary  detector  can  not  be  combined  with 
the  likelihood  generated  by  a  detector  for  boundaries  within  the  image  of  an  object  (  such  as 
comers  internal  to  the  image).  A  detector  of  the  likelihood  of  heads  on  a  coin  flip  can  not  be 
combined  with  a  detector  of  the  likelihood  of  rain  outside  using  this  theory.  (However  easy  it  may 
be  using  standard  probability  theory.) 

If  the  labeling  of  a  feature  f  implies  a  labeling  for  another  feature  g  then  in  theory  one  can 
combine  a  f  detector  with  a  g  detector  by  using  the  g  detector  that  is  implied  by  the  f  detector.  As 
an  example  a  region  grower  could  be  combined  with  a  boundary  detector  since  the  position  of  the 
regions  implies  the  positions  of  the  boundaries. 


3.1.  Combining  Two  Likelihoods 

The  formula  for  combining  the  likelihoods  generated  under  Mx  and  M2  requires  prior 
knowledge.  Necessary  are  the  prior  probabilities  P(MX)  and  P(M2)  that  the  domain  models  Mx  and 
U%  are  correct  as  well  as  P(M  XAM2).  Often  P{M x4tM2)  =  0.  When  this  occurs  the  two  models 
contradict  each  other.  1  call  two  such  models  disjoint  because  both  can  not  describe  the  situation 
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simultaneously.  If  Mx  is  a  model  with  noise  of  standard  deviation  4±c  and  M3  is  a  model  with 
noise  of  standard  deviation  8±e  then  their  assumptions  contradict  and  P(M  i  AM  2)  =  0. 

Prior  probabilities  for  the  feature  labels  under  each  model  (P{f=l\M  t)  and  P(f=l\M  2))  are 
necessary.  If  P(M  2AM3)*0  then  the  prior  probability  of  the  feature  label  under  the  conjunction  of 
Mi  and  Mt  (P(f=l\M  i&Mt))  and  the  output  of  a  likelihood  generator  for  the  conjunction  of  the 
two  models  (P(0\f-lA(M  iAM2)))  are  needed.  If  I  have  this  prior  information  I  can  derive 
P(0\f=lA(M 

If  I  were  to  combine  another  model,  Ms,  with  this  combination  I  need  the  priors  P(M3), 
P(f\M  3),  P(M3A(Mi\iM3)  and  P{f\M  3A(M iVAf2)).  To  add  on  another  model  I  need  another  4 
priors.  Thus  the  number  of  prior  probabilities  to  combine  n  models  is  linear  in  n. 

Thus  all  that  is  left  is  to  derive  the  combination  rule  for  likelihood  generators  given  this  prior 
information.  The  derivation  starts  by  applying  the  definition  of  conditional  probability  in  equation 

3. 


P(0\f=Uk(M  ,vJf,)) 


P(OAf=lA{M  iVMt)) 
P(f=lA(M  iVJfj)) 


(3) 


The  formula  for  probability  of  a  disjunction  is  applied  to  the  numerator  and  denominator  in 
equation  4. 


P{0\f=lA(M  \\jM j)) 


P(OAf=lAM  1)+P(OAf=lAM  t)-P{OAf=lAM  ,A3f,) 
P{f-IAM  i)+P(f=UkM  t)-P(f=lAM  ,AAf2) 


(4) 


In  equation  5  the  definition  of  conditional  probability  is  applied  again  to  the  terms  of  the 
numerator  and  the  denominator. 


P(0\f=lA(M  t\/Mt)) 


P(0\f=lAH  \)P{f—l\M  » HWf ,) 

+ 

P{0\f=lAM  t)P{f=l\M  ,)P(W2) 

P(0\f=l&M  i&Mt)P(f=l\M  \AM{)P{M \&M3) 

P(f=l\M  i)P{M 0+P(f=l\M  3)P{Mt)-P(f=l\M  i&Mt)P(M i&Mt) 


Different  assumptions  allow  different  simplifications  to  be  applied  to  the  rule  in  equation  5. 
If  the  two  models  are  disjoint  equation  5  reduces  to  equation  6. 


P(0\f=lA(M  xvlf,)) 


P(0\f=l&M  i)P(f=l\M  i)P(Mi) 
P(0\f=lAM  t)P(f=l\M  j)P(Af ,) 
P(f=l\M  i)P(M0+P{f=l\M  ,)P(Mt ) 


(6) 


Another  assumption  that  simplifies  things  considerably  is  the  assumption  that  prior  probabilities 
for  all  feature  labelings  in  all  the  models  and  combinations  thereof  are  the  same.  I  call  this 
assumption  constancy  of  prion.  When  constancy  of  priors  is  assumed 

P(/  =  {|Afi)  =  P(f=l\M  j)  =  P(f=l\M  tAMj).  Making  this  assumption  reduces  the  number  of 
prion  that  need  to  be  determined.  Since  determining  prior  probabilities  from  a  model  is  sometimes 
a  difficult  task  the  constancy  of  prion  is  a  useful  simplification.  With  constancy  of  prion  equation 
5  reduces  to  equation  7. 
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P(0\f=l&(M  i vJf,)) 


P(0|/“=i4Af  x)P(Wi) 

+ 

P(0  (/=/AAf  j)P(Jlf  j) 

PiO\f=l&M  xAMa)P(M xAMa) 
P{M  X)+P{M  a)-P{M  xAMa) 


(7) 


Equation  8  with  constancy  of  priors  reduces  to  equation  8. 


P(0\f=lA(M  xvMa))  = 


P(0\f=lAM  X)P(MX) 

+ 


PCOI/^JAM  2)P(M2) 


P(Mx)+P(Ma) 


(8) 


Thus  equation  8  describes  the  likelihood  combination  rule  with  disjoint  models  and  constancy  of 
priors. 


3.2.  Understanding  the  Likelihood  Combination  Rule 

Hie  easiest  incarnation  of  the  likelihood  combination  rule  to  understand  is  the  rule  for 
combining  likelihoods  from  disjoint  models  given  constancy  of  prion  across  models  (equation  8). 
Here  the  combined  likelihood  is  the  weighted  average  of  the  likelihoods  from  the  individual  models 
weighted  by  the  probabilities  of  the  models  applying.  (The  combined  likelihood  is  the  likelihood 
given  the  disjunction  of  the  models). 

If  models  Mx  and  Ma  are  considered  equally  probable  and  the  likelihoods  returned  by  Mat 
detector  are  considerably  larger  than  those  of  Mj’s  detector  then  the  probabilities  determined  from 
the  combination  of  Mx  and  Af2  are  close  to  those  determined  from  Mx.  Thus  a  model  with  large 
likelihoods  determines  the  probabilities.  To  illustrate  this  principle  consider  an  example. 

Assume  that  a  coin  has  been  flipped  n  + 1  times.  The  results  of  flipping  it  has  been  reported 
for  the  first  n  times.  The  task  is  to  determine  the  probability  of  heads  having  been  the  result  of 
the  n+1**  flip.  Consider  the  results  of  each  coin  flip  independent.  Let  Mx  be  the  coin  being  fair  so 
that  the  probability  of  heads  and  tails  is  equal.  Let  Af2  be  that  the  coin  is  biased  with  the 
probability  of  heads  is  tr  and  tails  1  -  ir  with  w  being  a  random  choice  with  equal  probability 
between  p  and  1—  p.  Hence  the  coin  is  biased  towards  heads  or  tails  with  equal  probability  but  the 
bias  is  consistent  between  coin  tosses.  The  probability  of  heads  remains  the  same  for  all  coin  tosses 
in  both  models.  Mx  and  Ma  are  disjoint  (the  coin  is  either  fair  or  it  isn’t  but  not  both)  and  the 
prior  probability  of  a  flip  being  heads  or  tail  is  the  same  for  both,  .5. 

Under  Mx  the  probability  of  each  of  the  possible  flips  of  n  +  1  coins  is  2~n_1.  Under  Ma  the 
probability  of  n+1  flips  of  coins  with  h  heads  and  t  =  n  +  l-h  tails  is: 

^*(1  -p)‘  +  rfpr(l-p)* 

Let  n=2  and  p  =  .9.  Assume  the  first  two  flips  are  both  heads.  Let  H  be  "the  third  flip  was  heads” 
and  T  be  "the  third  flip  was  tails.”  The  likelihood  of  H  given  the  observed  data  is  the  probability 
of  all  3  flips  being  heads  divided  by  the  probability  of  the  third  flip  being  heads.  The  likelihood  of 
T  given  the  observed  data  is  the  probability  of  the  first  2  being  heads  and  the  3rd  tails  divided  by 
the  probability  of  the  third  flip  being  tails. 

Under  Mx  the  probability  of  all  3  flips  being  heads  is  0.125  and  the  probability  of  a  flip  being 
heads  is  0.5  thus  the  likelihood  of  H  is  0.25.  The  likelihood  of  T  is  0.25  by  the  same  reasoning. 


■vinur*’*MXil*Ja  H*  I*  UJ  UJTSWHPV*.  i.H  v^vwwwwwy  "  “ 
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Applying  Bayes’  law  to  get  the  probability  of  H  under  Mi  one  derives  a  probability  of  .5  . 

Under  Mg  the  probability  of  all  3  flips  being  heads  is  0.365  and  the  probability  of  a  flip  being 
heads  is  0.5,  Thus  the  likelihood  of  H  is  0.73.  Under  Mg  the  probability  of  the  first  two  being 
heads  and  the  third  being  tails  is  0.045  and  the  probability  of  a  flip  being  tails  is  0.5.  Thus  the 
likelihood  of  T  is  0.09.  Applying  Bayes’  law  under  Mg  a  probability  of  H  being  0.89  is  derived. 

If  Mi  and  M)  are  considered  equally  probable  then  the  combination  of  the  likelihoods  from 
the  two  models  is  the  average  of  the  two  likelihoods.  Thus  the  likelihood  of  H  for  this  combination 
is  0.49  and  the  likelihood  of  T  is  0.17  (likelihoods  don’t  have  to  sum  to  1).  Bayes’  law  combines 
these  probabilities  to  get  0.74  for  the  3rd  flip  to  be  heads. 

Hie  table  in  figure  1  describes  combining  various  Mg's  with  different  values  of  p  with  Mx  for 
the  different  combinations  with  n= 4 


Observed 

Combined  with  Mt 

Likelihood  o (H 

likelihood  of  T 

Probability  of  H 

Coin  Flips 

or  just  Mg 

p=.6 

/>  =  ■  9 

P~-  6 

p  =  .9 

P=  6 

p  =  .9 

HHHH 

Just  Ma 

0.088 

0.5905 

0.0672 

0.0657 

0.567 

0.8999 

Combined 

0.07525 

0.3265 

0.06485 

0.0641 

0.537 

0.8359 

HHHT 

Just  Mg 

0.0672 

0.0657 

0.0576 

0.0081 

0.5385 

0.8902 

Combined 

0.06485 

0.0641 

0.06005 

0.0353 

0.5192 

0.6449 

HHTT 

Just  Mg 

0.0576 

0.0081 

0.0575 

0.0081 

0.5 

0.5 

Combined 

0.06005 

0.0353 

0.06 

0.0353 

0.5 

0.5 

HTTT 

Just  Mg 

0.0576 

0.0081 

0.0672 

0.0657 

0.4615 

0.1098 

Combined 

0.06005 

0.0353 

0.06485 

0.0641 

0.4808 

0.3551 

THT 

Just  Mg 

0.0672 

0.0657 

0.088 

0.5905 

0.433 

0.1001 

Combined 

0.06485 

0.0641 

0.07525 

0.3265 

0.4629 

0.1641 

Figure  1:  Result  of  likelihood  combination  Rule 

Look  at  the  probabilities  with  p  =  . 9  and  the  observed  data  is  HHHH.  For  this  case  the 
observed  data  fits  Mg  much  better  than  Mx  and  the  probability  from  combining  Aft  and  M2  is  close 
to  the  probability  resulting  from  using  just  Mg,  .9.  If  we  had  a  longer  run  of  headB  the  probability 
of  future  heads  would  approach  exactly  Mg’s  prediction,  .9.  On  the  other  hand  if  we  had  a  long  run 
of  equal  numbers  of  heads  and  tails  the  probability  of  future  heads  would  quickly  approach  the 
prediction  of  Mx,  .5.  When  the  observed  data  is  HHHT  the  observed  data  fits  about  as  well  as 
M s  and  the  resulting  probability  is  near  the  average  of  .5  predicted  by  M\  and  0.8902  predicted  by 
Mg.  Thus  when  the  observed  data  is  a  good  fit  for  a  particular  model  (like  Ma)  the  probabilities 
predicted  by  the  combination  is  close  to  the  probabilities  predicted  by  the  fitted  model.  If  two 
models  fit  about  equally  then  the  result  is  an  average  of  the  probabilities1. 

4.  When  No  Model  Applies 

Given  a  set  of  likelihood  generators  and  their  models,  using  the  evidence  combination 
described  in  section  3  we  can  get  the  likelihood  for  the  feature  labelings  given  that  at  least  one 
model  applies.  Thus  if  we  have  likelihoods  of  a  boundary  given  models  with  the  noise  standard 

'Hwwtf  the  feature  that  the  decision  theory  predicta  is  not  the  average  of  the  features  predicted  under  the  two 
different  models  in  general. 
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deviations  near  to  4,  8  and  16  in  them  we  can  derive  the  likelihood  of  a  given  the  noise  standard 
deviation  is  near  to  4  or  8  or  16  (no  matter  which).  Thus  we  can  derive  the  probability  distribution 
over  feature  labelings  given  that  at  least  one  of  our  models  applies.  However  what  we  are  trying 
to  derive  is  the  physical  probability  distribution  over  the  feature  labelings.  Hus  is  the  probability 
distribution  over  feature  labels  given  the  observed  data  (estimated  by  the  long  run  frequencies  over 
the  feature  labels  given  the  observed  data).  The  problem  is  that  there  may  be  a  case  where  none  of 
the  models  assumptions  is  true.  In  the  Venn  diagram  of  figure  2  each  set  represents  the  Bet  of 
situations  where  a  model's  assumptions  are  true.  The  area  marked  NO  MODEL  is  the  set  of 
situations  where  all  the  models  feil. 


What  should  the  likelihood  of  a  feature  label  be  if  no  model  applies?  To  answer  this  question 
I  examine  the  companion  question  of  what  should  the  probability  of  a  feature  label  be  if  no  model 
applies.  Assume  a  prior  probability  for  the  label  is  available.  If  a  posterior  probability  is  different 
from  a  prior  probability  for  the  feature  then  information  has  been  added  to  get  the  posterior.  (Only 
information  can  justify  changing  from  the  prior.)  Since  having  no  model  means  intuitively  having 
no  information  then  the  posterior  should  be  the  same  as  the  prior.  If  and  only  if  the  likelihoods  of 
all  feature  labels  are  equal,  the  posterior  probability  is  the  same  as  the  prior.  Hence  the 
likelihoods  of  the  feature  labels  should  be  equal  for  any  particular  piece  of  observed  data.  In  this 
section  I  assume  a  prior  proability  distribution  is  a  available  over  feature  labels.  If  no  such 
distribution  is  available  an  uninformative  prior  can  be  constructed  [Frieden85], 

To  constrain  the  problem  further,  consider  whether  any  piece  of  observed  data  should  be  more 
probable  than  any  ether  when  no  model  applies.  It  seems  unreasonable  that  one  could  conclude 
that  some  observations  are  more  probable  than  others  without  any  model  of  how  those  observations 
were  produced.  Hence  all  the  likelihoods  should  be  equal.  This  constraint  is  sufficient  to 
determine  the  likelihoods  when  no  model  applies.  I  think  that  this  solution  minimizes  cross 
entropy  with  the  prim  (since  it  returns  the  prior)  [Johnson85], 

To  derive  the  physical  probability  distribution  over  feature  labels,  the  "no  model”  likelihoods 
should  be  combined  with  the  likelihoods  derived  for  the  models.  The  probability  of  each  of  the 
models  and  their  combinations  must  have  been  available  to  use  the  combination  rules  from  section 
3.  Hence  the  probability  that  one  or  more  of  the  models  applies  is  known.  The  probability  of  no 
model  is  I  minus  that  probability.  The  conjunction  of  some  model  applying  and  no  model  applying 
has  0  probability.  Hence  combination  rule  6  can  be  applied  to  derive  the  likelihoods  under  any 
conditions  from  the  likelihoods  for  any  model  applying. 

As  example  consider  the  problem  of  seeing  HHHH  and  trying  to  derive  the  probability  of  a 
fifth  head  given  the  equally  likely  choices  that  the  coin  is  fair  or  is  biased  to  .9  (biased  either  for 
heeds  or  tails  with  equal  probability).  The  combined  likelihood  of  H  is  0.3265  (from  figure  1).  The 
combined  likelihood  of  T  is  0.0641.  As  an  example,  assume  that  the  probabilities  that  the 
assumptiens  of  were  true  was  0.4  and  similar  for  M2.  Then  0.4  of  the  time  we  feel  the  coin  is 
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lair,  0.4  of  the  time  we  feel  it  has  been  biased  by  0.9,  and  0.2  of  the  time  we  have  no  model  about 
what  happened.  The  likelihood  of  HHHH  under  "NO  MODEL”  is  .0625  regardless  of  H  or  7  (Since 
the  likelihood  of  all  4  coin  flip  events  are  equal  and  must  sum  to  1).  Combining  the  "NO  MODEL” 
likelihoods  with  likelihoods  of  0.2737  for  H  and  0.06378  for  T  (Bee  figure  1),  the  probability  of  H 
from  applying  Bayes’  law  to  these  likelihoods  is  0.811.  This  probability  is  somewhat  nearer  to  .5 
than  the  probability  of  0.8359  derived  without  taking  the  possibility  of  all  the  models  failing  into 
account. 

Taking  the  possibility  of  all  models  failing  lends  certain  good  properties  to  the  system. 
Probabilities  of  0  or  1  become  impossible  without  priors  of  0  or  1.  Thus  the  system  is  denied  total 
certainty.  Numbers  near  0  or  1  cause  singularities  in  the  equations  under  finite  precision 
arithmetic.  Total  certainty  represents  a  willingness  to  ignore  all  further  evidence.  I  find  that 
property  undesirable  in  a  system.  Denying  the  system  total  certainty  also  results  in  the  property 
that  the  system  must  have  all  probability  distribution  over  feature  labels  between  c  and  1  -  e  for 
an  e  proportional  to  the  probability  that  no  model  applies.  Thus  there  is  a  limit  to  how  certain  our 
system  is  about  any  feature  labeling  in  our  uncertain  world. 

5.  Results 

I  have  applied  this  evidence  combination  to  the  boundary  detection  likelihood  generators 
described  in  [Sher87].  Here  I  prove  my  claims  that  the  evidence  combination  theory  allows  me  to 
take  a  set  of  algorithms  that  are  effective  but  not  robust  and  derive  an  algorithm  that  is  robust. 
The  output  of  such  an  algorithm  is  almost  as  good  as  the  best  of  its  constituents  (the  algorithms 
that  are  combined). 

5.1.  Artificial  Images 

Artificial  images  were  used  to  test  the  algorithms  described  in  section  3  quantitatively.  I  used 
as  a  source  of  likelihoods  the  routines  described  in  [Sher87],  Because  the  positions  of  the 
boundaries  in  an  artificial  image  are  known  one  can  accurately  measure  false  positive  and  negative 
rates  for  different  operators.  Also  one  can  construct  artificial  images  to  precise  specifications.  The 
artificial  images  I  use  is  an  image  composed  of  overlapping  circles  with  constant  intensity  and 
aliasing  at  the  boundaries  shown  in  figure  3. 
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Figure  3:  Artificial  Test  Image 

The  intensities  of  the  circles  were  selected  from  a  uniform  distribution  from  0  fo  254.  To  the 
circles  were  added  normally  distributed  uncorrelated  noise  with  standard  deviations  4,  8,  12,  16, 
20,  and  32.  The  software  to  generate  images  of  this  form  was  built  by  Myra  Van  Inwegen  working 
under  my  direction.  This  software  will  be  described  in  an  upcoming  technical  report. 

Ia  figure  4  I  show  the  result  of  applying  the  detector  tuned  to  standard  deviation  4  noise  to 
the  artificial  image  with  standard  deviation  12  noise  added  to  it.  In  figure  5  I  show  the  result  of 
applying  the  detector  tuned  to  standard  deviation  12  noise  to  an  image  with  standard  deviation  12 
noise  added  to  it.  In  figure  6  I  show  the  result  of  applying  the  combination  of  the  detectors  tuned 
to  4,  6,  12,  and  16  standard  deviation  noise.  The  combination  rule  was  that  for  disjoint  models 
with  the  same  priors.  The  4  models  were  combined  with  equal  probability.  These  operator  outputs 
an  thnsholded  at  0.5  probability  with  black  indicating  an  edge  and  white  indicating  no  edge. 


a:  Image  with  o= 12  noise  b:  Output  of  a =4  detector 
Figure  4:  <r=4  detector  applied  to  3  image  with  o=12  noise 


a:  Image  with  o  =  12  noise  b:  Output  of  <7=12  detector 
Figure  5:  o=12  detector  applied  to  3  image  with  <7=12  noise 


a:  Image  with  a— 12  noise  b:  Output  of  combined  detector 
Figure  6:  Combined  detector  applied  to  3  image  with  o=  12  noise 

Note  that  the  result  of  using  the  combined  operator  is  similar  to  that  of  the  operator  tuned  to 
the  correct  noise  level.  Most  of  the  false  boundaries  found  by  the  e=4  operator  are  ignored  by  the 
combined  operator. 

Using  this  artificial  image  I  have  acquired  statistics  about  the  behavior  of  the  combined 
detector  vs  the  tuned  ones  under  varying  levels  of  noise.  Figure  7  shows  the  false  positive  rate  for 
the  detector  tuned  to  standard  deviation  4  noise  as  the  noise  in  the  image  increases*.  Figure  8 
shows  the  false  positives  for  the  standard  deviation  12  operator.  Figure  9  shows  the  false  positive 
rate  for  the  operator  tuned  to  the  current  standard  deviation  of  the  noise.  Figure  10  shows  the 
false  positive  rate  of  the  combined  operator.  Figure  11  shows  the  superposition  of  the  4  previous 
graphs. 
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Figure  7:  False  positives  v»  noise  C  for  operator  tuned  to  IT  =  4  noise 
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Figure  8:  False  positives  vs  noise  <T  for  operator  tuned  to  C  =  12  noise 
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square  0=4  operator  circle:  (7  =  12  operator 
triangle:  tuned  operator  crou:  combined  operator 

Figure  11:  Falae  positives  vs  noise  <T  for  all  operators 

Note  that  the  combined  operator  has  a  false  positive  rate  that  is  as  least  as  good  as  that  of  the 
tuned  operators. 

I  can  also  count  false  negatives.  When  I  counted  false  negatives  I  ignored  missed  boundaries 
that  had  an  boundary  reported  one  pixel  off  normal  to  the  boundary  (because  such  an  error  is  a 
matter  of  discretization  rather  than  of  a  more  fundamental  sort).  See  figure  12  for  an  example  of  a 
1  pixel  off  error. 


MISS  GOOD 

MISS  is  recorded  as  a  false  negative 
GOOD  is  recorded  as  a  true  positive 

Figure  12:  Example  of  one  pixel  off  error 

Figure  13  shows  the  false  negative  rate  for  the  detector  tuned  to  standard  deviation  4  noise  as 
the  noise  in  the  image  increases.  Figure  14  shows  the  false  negatives  for  the  standard  deviation  12 
operator,  figure  IS  shows  the  false  negative  rate  for  the  operator  tuned  to  the  current  standard 
deviation  of  the  noise.  Figure  16  shows  the  false  negative  rate  of  the  combined  operator.  Figure  17 
shows  the  superposition  of  the  4  previous  graphs. 
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■VMR  0—4  apwtai  drcla:  0  —  12  operator 
tnangW:  tuned  ytir  onk  combined  operator 

Figure  17:  Falae  negative  rate  for  all  operators 

Here  the  combined  operator  ie  not  always  ••  good  as  the  tuned  operator*.  One  must  ask  if 
this  tendency  of  the  combined  operator  to  miss  edges  offsets  its  better  performance  for  false 
positives.  The  next  aerie*  of  figures  charts  the  total  error  rate  for  the  same  cases.  Figure  18  shows 
the  error  rate  for  the  detector  tuned  to  standard  deviation  4  noise  as  the  noise  in  the  image 
increases.  Figure  19  shows  the  error  rate  for  the  standard  deviation  12  operator.  Figure  20  shows 
the  error  rate  for  the  operator  tuned  to  the  current  standard  deviation  of  the  noise.  Figure  21 
shows  the  error  rate  of  the  combined  operator.  Figure  22  shows  the  superposition  of  the  4  previous 
graphs. 
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■quart:  O’ =4  operator 
triangle:  tuned  operator 


0  =  12  operator 

combined  operator 

Figure  22:  Total  error*  by  the  all  detector* 


Thus  the  auperiority  of  the  combined  operator  for  false  positives  dominates  the  false  negative 
performance  and  the  combined  operator  minimises  the  number  of  errors  in  total.  These  results  are 
evidence  that  my  combination  rule  is  robust. 


5.2.  Real  Images 

I  have  also  tested  these  theories  using  two  images  taken  by  cameras.  One  of  these  images  is 
a  tinker  toy  image  taken  in  our  lab.  The  other  is  an  aerial  image  of  the  vicinity  of  Lake  Ontario. 
Figure  23  shows  the  result  of  the  operator  tuned  to  standard  deviation  4  noise  applied  to  the  tinker 
toy  image  and  thresholded  at  0.5  probability.  Figure  24  shows  the  result  of  the  operator  tuned 
standard  deviation  12  noise  applied  to  the  tinker  toy  image.  Figure  25  shows  the  effect 
combining  operators  tuned  to  standard  deviation  4,  8,  12  and  16  with  equal  probability. 
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a:  Tinkertoy  Image  b:  Output  of  combined  detector 
Figure  25:  Combined  detectin'  applied  to  tinkertoy  image 


Here,  the  result  of  the  combined  operator  seems  to  be  a  cleaned  up  version  of  the  standard 
deviation  4  operator.  Most  of  the  features  that  are  represented  in  the  output  of  the  combined 
operator  are  however  real  features  of  the  scene.  The  line  running  horixontaDy  across  the  image 
that  the  standard  deviation  4  operator  and  the  combined  operator  found  is  the  place  where  the 
table  meets  the  curtain  behind  the  tinkertoy.  The  standard  deviation  4  operator  was  certain  of  its 
interpretation  and  the  other  operators  were  uncertain  at  that  point  so  its  interpretation  was  used 
by  the  combination. 

The  results  from  the  aerial  image  are  also  instructive.  Figure  26  shows  the  result  of  the 
operator  tuned  to  standard  deviation  4  noise  applied  to  the  aerial  image  and  thresholded  at  0.5 
probability.  Figure  27  shows  the  result  of  the  operator  tuned  to  standard  deviation  12  noise 
applied  to  the  aerial  image.  Figure  28  shows  the  effect  of  combining  operators  tuned  to  standard 
deviation  4,  8,  12  and  16  with  equal  probability. 


a:  Aerial  Image  b:  Output  of  combined  detector 
Figure  28:  Combined  detector  applied  to  aerial  image 

The  results  from  the  combined  operator  are  again  a  cleaned  up  version  of  the  results  from  the 
standard  deviation  4  operator.  1  believe  this  behavior  occurs  again  because  the  features  being 
found  by  the  standard  deviation  4  operator  am  in  the  scene.  However  I  do  not  have  the  ground 
truth  for  the  aerial  image  as  I  do  for  the  tinkertoy  image. 

5.3.  Future  Experiments 

Soon,  I  will  apply  my  evidence  combination  rules  to  operators  that  make  different 
assumptions  about  the  expected  image  intensity  histogram.  The  operator  used  so  far  in  my 
experiments  expects  a  uniform  histogram  between  0  and  254.  Currently,  a  likelihood  generator 
has  been  built  that  assumes  a  triangular  distribution  with  the  probability  of  an  object  having 
intensity  leu  than  128  being  one  fourth  the  probability  of  an  object  having  intensity  greater  than 
or  equal  to  128.  It  is  not  dear  that  the  probabilitiu  calculated  based  on  this  auumption  will  be 
significantly  different  from  thou  based  on  the  uniform  histogram  auumption.  If  there  is  no 
difference  in  the  output  of  two  operators  the  effect  of  combination  is  invisible. 

Larger  operators  will  soon  be  available.  The  likelihoods  generated  based  on  them  larger 
operators  would  be  finely  tuned.  The  same  evidence  combination  can  be  applied  to  them  operators. 

likelihoods  are  used  by  Markov  random  field  algorithms  to  determine  posterior  probabilitiu 
[Marroquin85b]  [Chou87].  Likelihoods  resulting  from  my  combination  rules  can  be  used  by  Markov 
random  field  algorithms. 

6.  Previous  Work 

Much  of  the  work  on  evidence  and  evidence  combination  in  vision  has  been  on  high  level 
vision.  An  important  Bayesian  approach  (and  a  motivation  for  my  work)  was  by  Feldman  and 
Yakimovsky  [Feldman74].  In  this  work  Feldman  and  Yakimovsky  were  studying  region  merging 
based  on  high  level  constraints.  They  first  tried  to  find  a  probability  distribution  over  the  labels  of 
a  region  using  characteristics  such  as  mun  color  or  texture.  They  then  tried  to  improve  them 
distributions  using  labelings  for  the  neighbors.  Then  they  made  merge  decisions  based  on  whether 


it  *u  sufficiently  probable  that  two  adjacent  regions  were  the  Mine. 

Work  with  a  similar  flavor  has  been  done  by  Hanson  and  Rieeman.  In  [Hanaon80]  Bayesian 
theories  axe  applied  to  edge  relaxation.  This  work  had  serious  problems  with  its  models  and  the 
fact  that  the  initial  probabilities  input  were  edge  strengths  normalised  never  to  exceed  1.  Of 
course  such  edge  strengths  have  little  relationship  to  probabilities  (a  good  edge  detector  tries  to  be 
monetenic  in  its  output  with  probability  but  that  is  about  as  hr  as  it  gets).  In  [Wesley82a]  and 
[Wealey82b]  Dempeter-Shafer  evidence  theory  is  used  to  model  and  understand  high  level  problems 
in  vision  especially  region  labeling.  In  [Wealey82b]  there  is  some  informed  criticism  of  Bayesian 
approaches.  In  [ReynoldsS5]  They  study  how  one  converts  low  level  Isa  tore  values  into  input  for  a 
Dempeter-Shafer  evidence  system. 

in  [Levitt85]  Tod  Levitt  takes  an  approach  to  managing  a  hierarchical  hypothesis  space  that 
is  beyiian  with  some  ad  hoc  assumptions.  For  the  problem  worked  on  here  the  paper  would  take 
weighted  rams  of  probabilities.  He  does  not  have  any  way  of  taking  an  operators  self  confidence 
into  account  in  the  evidence  combination.  Since  he  was  not  approaching  this  problem  in  his  paper 
I  can  net  fault  it  in  this  respect. 

There  has  been  much  use  of  likelihoods  in  recent  vision  work.  In  particular  work  based  on 
Markov  random  fields  [Geman84]  [Marraquin86a]  [Marroquin85b]  use  likelihoods.  A  Markov 
random  field  is  a  prior  probability  distribution  far  soma  feature  of  an  image  end  the  likelihoods  ere 
need  te  compute  the  marginal  posterior  probability  that  are  used  to  update  the  field.  Hers  lick  has 
mentioned  that  his  facet  model  [Hare lick 84]  [HaraKck86b]  can  be  easily  used  to  build  edge 
detectors  that  return  likelihoods  [Hara lick 86a],  I  also  have  built  boundary  detectors  that  return 
Hkelfaeeds  and  the  moults  of  using  them  is  documented  in  [Sher871  Paul  Chou  is  using  the 
likelihoods  I  produce  with  Markov  random  fields  far  edge  relaxation  [Chou87],  Ha  is  also  studying 
the  usa  of  likelihoods  far  information  fusion.  Currently,  ha  is  concentrating  on  information  fusion 
from  different  sources  of  information. 

7.  Conclusion 

I  have  presented  e  Bayesian  technique  for  information  fusion.  I  show  how  to  fuse  information 
from  detectors  with  different  models.  I  presented  results  from  applying  these  techniques  to 
artificial  and  real  images. 

These  techniques  take  several  operators  that  am  tuned  to  work  well  when  the  scene  has 
certain  particular  properties  end  get  an  algorithm  that  works  almost  as  well  as  the  beat  of  the 
operators  being  combined.  Since  moat  algorithms  available  for  machine  vision  am  erratic  when 
their  assumptions  am  violated  this  work  can  be  used  to  improve  the  robustness  of  many 
algorithms. 
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